Fix CloudFetch goroutine leak that retains Arrow buffers after Close#357
Merged
Conversation
The cloudFetchDownloadTask goroutine sends its result on an unbuffered channel without respecting context cancellation. When the iterator is closed mid-download (timeout, error, or consumer dropping the result set), in-flight goroutines that have already finished their HTTP read sit blocked on the send forever, pinning the downloaded buffer in the Go heap. Under bursts of large CloudFetch queries this manifests as a multi-GiB heap plateau that only releases on process restart. Fix: route the channel send through a helper that selects on ctx.Done(), so cancellation via task.cancel() (already issued from cloudIPCStreamIterator.Close) drains the goroutine and frees its buffer. Closes #356 Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
gopalldb
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #356.
Under high CloudFetch concurrency (≥6 simultaneous downloads), in-flight
cloudFetchDownloadTaskgoroutines could leak when the consumer closed the iterator before draining all results. Each leaked goroutine pinned a downloaded chunk in the Go heap, producing the multi-GiB heap plateau described in the issue that only released on process restart.Root cause
cloudFetchDownloadTask.Runsends the download result on an unbuffered channel without honoring context cancellation:Sequence that triggers the leak:
cloudIPCStreamIterator.NextschedulesMaxDownloadThreads(default 10) tasks concurrently.iterator.Close().Closecallstask.cancel()on each remaining task. But context cancellation does not unblock an in-flight channel send — the goroutines stay blocked forever, retaining their buffers.In v1.7.1 (the version the reporter is on) the goroutine had already decoded the bytes into Arrow records before the send, so the leaked memory was Arrow-allocator buffers — matching the stack trace in the issue:
In the current code (v1.11.0) the decode happens later in
batchIterator.Next, so the leak is the raw decompressedbufinstead — same shape, smaller per-goroutine retention, same plateau pattern.Fix
Route every channel send through a helper that selects on
ctx.Done():cloudIPCStreamIterator.Closealready callstask.cancel()for every queued task, so cancellation now correctly drains stuck goroutines and lets their buffers be GC'd.Test plan
TestCloudFetchIterator_CloseReleasesInFlightDownloadsreproduces the leak: spawnsMaxDownloadThreadsconcurrent downloads, releases them after the iterator has consumed only the first, then callsClose()and asserts that nocloudFetchDownloadTask.Rungoroutines remain.main(~9 leaked goroutines afterClose).go test ./...passes locally.go vetandgofmtclean.Who is affected
Any user with CloudFetch enabled (default since v1.7.0) whose query context can be cancelled or whose result set can be abandoned mid-stream — i.e., basically everyone running large CloudFetch queries with timeouts.
This pull request and its description were written by Isaac.